Computer-Assisted Keyword and Document Set Discovery from Unstructured Text
نویسندگان
چکیده
The (unheralded) first step in many applications of automated text analysis involves selecting keywords to choose documents from a large text corpus for further study. Although all substantive results depend crucially on this choice, researchers typically pick keywords in ad hoc ways, given the lack of formal statistical methods to help. Paradoxically, this often means that the validity of the most sophisticated text analysis methods depends in practice on the inadequate keyword counting or matching methods they are designed to replace. The same ad hoc keyword selection process is also used in many other areas, such as following conversations that rapidly innovate language to evade authorities, seek political advantage, or express creativity; generic web searching; eDiscovery; look-alike modeling; intelligence analysis; and sentiment and topic analysis. We develop a computer-assisted (as opposed to fully automated) statistical approach that suggests keywords from available text, without needing any structured data as inputs. This framing poses the statistical problem in a new way, which leads to a widely applicable algorithm. Our specific approach is based on training classifiers, extracting information from (rather than correcting) their mistakes, and then summarizing results with Boolean search strings. We illustrate how the technique works with examples in English and Chinese. ∗Our thanks to Dan Gilbert for helpful suggestions. †Institute for Quantitative Social Science, 1737 Cambridge Street, Harvard University, Cambridge MA 02138; GKing.harvard.edu, [email protected], (617) 500-7570. ‡Institute for Quantitative Social Science, 1737 Cambridge Street, Harvard University, Cambridge, MA 02138; www.patricklam.org §Department of Government, 1737 Cambridge Street, Harvard University, Cambridge MA 02138; scholar.harvard.edu/mroberts
منابع مشابه
Keyword-Based Browsing and Analysis of Large Document Sets
Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. This paper describes the KDT system for Kno...
متن کاملDiscriminative Features Selection in Text Mining Using TF - IDF Scheme
This paper describes technique for discriminative features selection in Text mining. 'Text mining’ is the discovery of new, previously unknown information, by computer. Discriminative features are the most important keywords or terms inside document collection which describe the informative news included in the document collection. Generated keyword set are used to discover Association Rules am...
متن کاملEfficient Text and Semi-structured Data Mining: Knowledge Discovery in the Cyberspace
This paper describes applications of the optimized pattern discovery framework to text and Web mining. In particular, we introduce a class of simple combinatorial patterns over texts such as proximity phrase association patterns and ordered and unordered tree patterns modeling unstructured texts and semi-structured data on the Web. Then, we consider the problem of finding the patterns that opti...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملModified Approach of Multinomial Naïve Bayes for Text Document Classification
This work proposes a text classification using modified approach of Multinomial Naïve Bayes for justifying and identifying the documents into a particular category. Due to the exploration of the textual information from the electronic digital documents as well as World Wide Web. Naïve Bayes theorem is effective for classification of text documents into the predefined categories by means of the ...
متن کامل